Source of this sort of analysis: https://www.gapminder.org/
10/16/2020
Source of this sort of analysis: https://www.gapminder.org/
Syllabus and expectations
Counting and how we will proceed
The many faces of data
Pilots should love to know the defects in their planes. Unreliability aids the informativeness needed to make difficult decisions, like landing in bad weather.
Organizations that have good rainy day funds are more valuable than those that use their cash to buy back stock. When the rain arrives, and it always does, cash will keep jobs and businesses afloat.
Probability does not exist, neither do means and standard deviations.
Any distribution of anything that can be described only by means and standard deviations is (plausibly) the least informative.
Why do cooks eat their own food (or should)? If the cook fails, the cook suffers and may die, thus eliminating very dangerous people from the planet. (a saying of Taleb)
Do until (days left in the semester == 0) OR (you have not completed your learning of a topic)
Prepare for each week through reading, research, practice, and marching through the videos
Live sessions to reinforce key topics and the solving of posed problems
Questions and answers and follow-up on THE WALL for the next round
| major | if A is true, then B is true |
| minor | B is true |
| conclusion | thus, A becomes more plausible |
| P | Q | if P then Q | P | Q |
|---|---|---|---|---|
| TRUE | TRUE | TRUE | TRUE | TRUE |
| TRUE | FALSE | FALSE | TRUE | FALSE |
| FALSE | TRUE | TRUE | FALSE | TRUE |
| FALSE | FALSE | TRUE | FALSE | FALSE |
| major | if A is true, then B is true |
| minor | A is false |
| conclusion | thus, B becomes less plausible |
Concoct a data story
Condition the story with data observations
Critique the conditioned data story
We know there are positive and negative cases of a new virus in 4 zip codes.
Three data collectors observe at random and independently a positive, then a negative, then another positive zip code. The sites might all have the same zips or not.
We ask: how many of the 4 zip codes test positive?
library(tidyverse)
library(rethinking)
n <- 1000
n_success <- 6
n_trials <- 8
(
binomial_model <-
tibble(p_grid = seq(from = 0, to = 1, length.out = n),
# note we're still using a flat uniform prior
prior = 1) %>%
mutate(likelihood = dbinom(n_success, size = n_trials, prob = p_grid)) %>%
mutate(posterior = (likelihood * prior) / sum(likelihood * prior))
)
## # A tibble: 1,000 x 4 ## p_grid prior likelihood posterior ## <dbl> <dbl> <dbl> <dbl> ## 1 0 1 0. 0. ## 2 0.00100 1 2.81e-17 2.53e-19 ## 3 0.00200 1 1.80e-15 1.62e-17 ## 4 0.00300 1 2.04e-14 1.84e-16 ## 5 0.00400 1 1.14e-13 1.03e-15 ## 6 0.00501 1 4.36e-13 3.93e-15 ## 7 0.00601 1 1.30e-12 1.17e-14 ## 8 0.00701 1 3.27e-12 2.94e-14 ## 9 0.00801 1 7.27e-12 6.55e-14 ## 10 0.00901 1 1.47e-11 1.32e-13 ## # ... with 990 more rows
summary( binomial_model )
## p_grid prior likelihood posterior ## Min. :0.00 Min. :1 Min. :0.000000 Min. :0.000e+00 ## 1st Qu.:0.25 1st Qu.:1 1st Qu.:0.003022 1st Qu.:2.722e-05 ## Median :0.50 Median :1 Median :0.064970 Median :5.853e-04 ## Mean :0.50 Mean :1 Mean :0.111000 Mean :1.000e-03 ## 3rd Qu.:0.75 3rd Qu.:1 3rd Qu.:0.220190 3rd Qu.:1.984e-03 ## Max. :1.00 Max. :1 Max. :0.311462 Max. :2.806e-03
library(tidybayes) # Mode() helper function
library(plotly) # make the plot interactive
# how many samples would you like
n_samples <- 10000 # 1e4
# make it reproducible
set.seed(42) # Hitchhiker's Guide
samples <-
binomial_model %>%
sample_n( size = n_samples, weight = posterior, replace = TRUE )
#
y_label <- "h = proportion of positive tests"
x_label <- "sample index"
title <- "Zip Code Tests: Bronx"
p_MAP <- Mode(samples$p_grid) #MAximum A Posteriori point estimate
title <- "Bronx Zip Code Tests"
x_label <- "proporation of zip codes testing positive"
y_label <- "posterior density"
plt <- samples %>%
ggplot(aes(x = p_grid)) +
geom_density(fill = "blue", alpha = 0.3) +
scale_x_continuous(x_label, limits = c(0, 1)) +
geom_vline(xintercept=p_MAP, color = "orange", size = 1.3) +
annotate( "text", x = 0.50, y = 2, label = paste0("MAP = ", round(p_MAP, 4)) ) +
ylab(y_label) + xlab(x_label) +
ggtitle(title)
# ggplotly(plt) Uncomment this to see the plot next
We set up a context, a story that has data associated with it:
4 zip codes
We collected 3 observations in various zip codes, conditioned the data against the observations
5 hypotheses, theories, models conjectured
Conditioned the models with observations
Counted (finally admitting this!) the ways a model is consistent with data
Consistency means can the data imply a model (LOGIC!)
We analyzed the ways and found the plausibility of each model; we then might have been forced by our employer to select the most plausible theory
Probability, Likelihood, is Plausibility
Ways normed by their sum
Only one theory is most probable: sample, sample, sample
Blue and red
Plausibility
Direction
Time
Troop strength
Temperature
Geo-coordinates
A hierarchy is an analytical technique that takes a group of objects and asks two questions:
How are the objects (nodes) related to one another (edges)? (just a network)
What objects are parents (higher level) or children (lower level) of one another
Open NYC Data has NYPD complaint data for the past few months at https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Current-YTD/5uac-w243/data
A simple lat and lon (latitude and longitude) scatter plot will indicate light and dark areas for us to inspect the data.
## # A tibble: 6 x 4 ## CMPLNT_FR_DT PD_DESC lat lon ## <chr> <chr> <dbl> <dbl> ## 1 12/10/2016 FRAUD,UNCLASSIFIED-FELONY 40.9 -73.9 ## 2 12/3/2016 LARCENY,PETIT FROM AUTO 40.9 -73.9 ## 3 11/16/2016 CRIMINAL MISCHIEF,UNCLASSIFIED 4 40.9 -73.8 ## 4 10/27/2016 LARCENY,PETIT FROM OPEN AREAS, 40.9 -73.8 ## 5 1/1/2016 LEAVING SCENE-ACCIDENT-PERSONA 40.8 -73.9 ## 6 12/1/2015 FRAUD,UNCLASSIFIED-FELONY 40.9 -73.9
A procedure of story, conditioning, choice
The many ways we can represent our physical reality with data, models, and plausibility
Contingency tables
Logic and the Reverend Bayes
Check MOODLE for each week’s activities and requirements
In this fully remote course: every day is a class day!